Chris Pollett > Students >
Aggarwal

    ( Print View)

    [Bio]

    [Blog]

    [CS297 Proposal]

    [Deliverable-1]

    [Deliverable-2]

    [Deliverable-3]

    [Deliverable-4]

    [CS297_Report - PDF]

    [CS298 Proposal]

    [Code]

    [CS298_Report - PDF]

    [CS298_Presentation - PDF]

Read from and write to WebARChive files

Ishaan Aggarwal (ishaan.aggarwal@sjsu.edu)

Purpose:

The Web ARChive (.WARC) files are the files which are aggregation of multiple web pages in a compressed format. These files have been used for storing the web-crawl data as sequence of blocks, collected by the web crawlers. Along with their index files (.CDX), it becomes easier to jump to the offset in memory which stores a relevant information, without needing to decompress the whole files. The Yioop! search engine stores its crawl data in .warc format, which makes this deliverable a useful tool to read-from and write-to the WARC files.



Implementation Overview:

Project Setup:

The project has two parts. One is the source folder which contains the actual implementation for the methods pertaining to the reading and writing of warc files, and other is where the source is being utilized to read/write warc files.


cargo.toml file for the library

[package]
name = "warc"
version = "0.3.0"
authors = ["ISHAAN-PC"]
documentation = "https://docs.rs/crate/warc/"
edition = "2018"

[dependencies]
chrono = "0.4.11"
nom = "5.1.1"
uuid = { version = "0.8.1", features = ["v4"] }

[dependencies.libflate]
version = "1"
optional = true

[features]
default = ["gzip"]
gzip = ["libflate"]

How to run:

To run this project, run command - "cargo build" in the parent directory of the project. This builds the library to be used. Ensure that you have a warc file with the name webcrawl-1.warc in the same directory. Next, run command - "cargo run". The warc file will be read and the contents will be displayed in the logs on terminal.

Code: warc-reader-writer library

It consists of 9 files with functions separated as per the part of operation they deal with. This can be understood from file names.


lib.rs


//! A WARC (Web ARChive) library

mod error;
pub use error::Error;

mod warc_reader;
pub use warc_reader::WarcReader;
mod warc_writer;
pub use warc_writer::WarcWriter;

pub mod header;

pub mod parser;

mod record;
pub use record::{RawRecord, Record, RecordBuilder};

mod record_type;
pub use record_type::RecordType;

mod truncated_type;
pub use truncated_type::TruncatedType;

warc-reader.rs


use crate::parser;
use crate::{Error, RawRecord};

use std::fs;
use std::io;
use std::io::{BufRead, BufReader};
use std::path::Path;

#[cfg(feature = "gzip")]
use libflate::gzip::Decoder as GzipReader;

const KB: usize = 1_024;
const MB: usize = 1_048_576;

/// A reader which iteratively parses WARC records from a stream.
pub struct WarcReader<R> {
    reader: R,
}

impl<R: BufRead> WarcReader<R> {
    /// Create a new reader.
    pub fn new(r: R) -> Self {
        WarcReader { reader: r }
    }
}

impl WarcReader<BufReader<fs::File>> {
    /// Create a new reader which reads from file.
    pub fn from_path<P: AsRef<Path>>(path: P) -> io::Result<Self> {
        let file = fs::OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(&path)?;
        let reader = BufReader::with_capacity(1 * MB, file);

        Ok(WarcReader::new(reader))
    }
}

#[cfg(feature = "gzip")]
impl WarcReader<BufReader<GzipReader<std::fs::File>>> {
    /// Create a new reader which reads from a compressed file.
    ///
    /// Only GZIP compression is currently supported.
    pub fn from_path_gzip<P: AsRef<Path>>(path: P) -> io::Result<Self> {
        let file = fs::OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(&path)?;
        let gzip_stream = GzipReader::new(file)?;
        let reader = BufReader::with_capacity(1 * MB, gzip_stream);

        Ok(WarcReader::new(reader))
    }
}

impl<R: BufRead> Iterator for WarcReader<R> {
    type Item = Result<RawRecord, Error>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut header_buffer: Vec<u8> = Vec::with_capacity(64 * KB);
        let mut found_headers = false;
        while !found_headers {
            let bytes_read = match self.reader.read_until(b'\n', &mut header_buffer) {
                Err(_) => return Some(Err(Error::ReadData)),
                Ok(len) => len,
            };

            if bytes_read == 0 {
                return None;
            }

            if bytes_read == 2 {
                let last_two_chars = header_buffer.len() - 2;
                if &header_buffer[last_two_chars..] == b"\r\n" {
                    found_headers = true;
                }
            }
        }

        let headers_parsed = match parser::headers(&header_buffer) {
            Err(_) => return Some(Err(Error::ParseHeaders)),
            Ok(parsed) => parsed.1,
        };
        let version_ref = headers_parsed.0;
        let headers_ref = headers_parsed.1;
        let expected_body_len = headers_parsed.2;

        let mut body_buffer: Vec<u8> = Vec::with_capacity(1 * MB);
        let mut found_body = expected_body_len == 0;
        let mut body_bytes_read = 0;
        let maximum_read_range = expected_body_len + 4;
        while !found_body {
            let bytes_read = match self.reader.read_until(b'\n', &mut body_buffer) {
                Err(_) => return Some(Err(Error::ReadData)),
                Ok(len) => len,
            };

            body_bytes_read += bytes_read;

            // we expect 4 characters (\r\n\r\n) after the body
            if bytes_read == 2 && body_bytes_read == maximum_read_range {
                found_body = true;
            }

            if bytes_read == 0 {
                return Some(Err(Error::UnexpectedEOB));
            }

            if body_bytes_read > maximum_read_range {
                return Some(Err(Error::ReadOverflow));
            }
        }

        let body_ref = &body_buffer[..expected_body_len];

        let record = RawRecord {
            version: version_ref.to_owned(),
            headers: headers_ref
                .into_iter()
                .map(|(token, value)| (token.into(), value.to_owned()))
                .collect(),
            body: body_ref.to_owned(),
        };
        return Some(Ok(record));
    }
}

#[cfg(test)]
mod tests {
    use std::collections::HashMap;
    use std::io::{BufReader, Cursor};
    use std::iter::FromIterator;

    use crate::{header::WarcHeader, WarcReader};
    macro_rules! create_reader {
        ($raw:expr) => {{
            BufReader::new(Cursor::new($raw.get(..).unwrap()))
        }};
    }

    #[test]
    fn basic_record() {
        let raw = b"\
            WARC/1.0\r\n\
            Warc-Type: dunno\r\n\
            Content-Length: 5\r\n\
            WARC-Record-Id: <urn:test:basic-record:record-0>\r\n\
            WARC-Date: 2020-07-08T02:52:55Z\r\n\
            \r\n\
            12345\r\n\
            \r\n\
        ";

        let expected_version = "1.0";
        let expected_headers: HashMap<WarcHeader, Vec<u8>> = HashMap::from_iter(
            vec![
                (WarcHeader::WarcType, b"dunno".to_vec()),
                (WarcHeader::ContentLength, b"5".to_vec()),
                (
                    WarcHeader::RecordID,
                    b"<urn:test:basic-record:record-0>".to_vec(),
                ),
                (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
            ]
            .into_iter(),
        );
        let expected_body: &[u8] = b"12345";

        let mut reader = WarcReader::new(create_reader!(raw));
        let record = reader.next().unwrap().unwrap();
        assert_eq!(record.version, expected_version);
        assert_eq!(record.headers, expected_headers);
        assert_eq!(record.body, expected_body);
    }

    #[test]
    fn two_records() {
        let raw = b"\
            WARC/1.0\r\n\
            Warc-Type: dunno\r\n\
            Content-Length: 5\r\n\
            WARC-Record-Id: <urn:test:two-records:record-0>\r\n\
            WARC-Date: 2020-07-08T02:52:55Z\r\n\
            \r\n\
            12345\r\n\
            \r\n\
            WARC/1.0\r\n\
            Warc-Type: another\r\n\
            WARC-Record-Id: <urn:test:two-records:record-1>\r\n\
            WARC-Date: 2020-07-08T02:52:56Z\r\n\
            Content-Length: 6\r\n\
            \r\n\
            123456\r\n\
            \r\n\
        ";

        let mut reader = WarcReader::new(create_reader!(raw));
        {
            let expected_version = "1.0";
            let expected_headers: HashMap<WarcHeader, Vec<u8>> = HashMap::from_iter(
                vec![
                    (WarcHeader::WarcType, b"dunno".to_vec()),
                    (WarcHeader::ContentLength, b"5".to_vec()),
                    (
                        WarcHeader::RecordID,
                        b"<urn:test:two-records:record-0>".to_vec(),
                    ),
                    (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
                ]
                .into_iter(),
            );
            let expected_body: &[u8] = b"12345";

            let record = reader.next().unwrap().unwrap();
            assert_eq!(record.version, expected_version);
            assert_eq!(record.headers, expected_headers);
            assert_eq!(record.body, expected_body);
        }

        {
            let expected_version = "1.0";
            let expected_headers: HashMap<WarcHeader, Vec<u8>> = HashMap::from_iter(
                vec![
                    (WarcHeader::WarcType, b"another".to_vec()),
                    (WarcHeader::ContentLength, b"6".to_vec()),
                    (
                        WarcHeader::RecordID,
                        b"<urn:test:two-records:record-1>".to_vec(),
                    ),
                    (WarcHeader::Date, b"2020-07-08T02:52:56Z".to_vec()),
                ]
                .into_iter(),
            );
            let expected_body: &[u8] = b"123456";

            let record = reader.next().unwrap().unwrap();
            assert_eq!(record.version, expected_version);
            assert_eq!(record.headers, expected_headers);
            assert_eq!(record.body, expected_body);
        }
    }
}

warc-writer.rs


use crate::{RawRecord, Record};

use std::fs;
use std::io;
use std::io::{BufWriter, Write};
use std::path::Path;

#[cfg(feature = "gzip")]
use libflate::gzip::Encoder as GzipWriter;

const MB: usize = 1_048_576;

/// A writer which writes records to an output stream.
pub struct WarcWriter<W> {
    writer: W,
}

impl<W: Write> WarcWriter<W> {
    /// Create a new writer.
    pub fn new(w: W) -> Self {
        WarcWriter { writer: w }
    }

    /// Write a single record.
    ///
    /// The number of bytes written is returned upon success.
    pub fn write(&mut self, record: &Record) -> io::Result<usize> {
        self.write_raw(&record.to_raw())
    }

    /// Write a single raw record.
    ///
    /// The number of bytes written is returned upon success.
    pub fn write_raw(&mut self, record: &RawRecord) -> io::Result<usize> {
        let mut bytes_written = 0;

        bytes_written += self.writer.write(&[87, 65, 82, 67, 47])?;
        bytes_written += self.writer.write(record.version.as_bytes())?;
        bytes_written += self.writer.write(&[13, 10])?;

        for (token, value) in record.headers.iter() {
            bytes_written += self.writer.write(token.to_string().as_bytes())?;
            bytes_written += self.writer.write(&[58, 32])?;
            bytes_written += self.writer.write(&value)?;
            bytes_written += self.writer.write(&[13, 10])?;
        }
        bytes_written += self.writer.write(&[13, 10])?;

        bytes_written += self.writer.write(&record.body)?;
        bytes_written += self.writer.write(&[13, 10])?;
        bytes_written += self.writer.write(&[13, 10])?;

        Ok(bytes_written)
    }
}

impl<W: Write> WarcWriter<BufWriter<W>> {
    /// Consume this writer and return the inner writer.
    ///
    /// # Flushing Compressed Data Streams
    ///
    /// This method is necessary to be called at the end of a GZIP-compressed stream. An extra call
    /// is needed to flush the buffer of data, and write a trailer to the output stream.
    ///
    /// ```ignore
    /// let gzip_stream = writer.into_inner()?;
    /// gzip_writer.finish().into_result()?;
    /// ```
    ///
    pub fn into_inner(self) -> Result<W, std::io::IntoInnerError<BufWriter<W>>> {
        self.writer.into_inner()
    }
}

impl WarcWriter<BufWriter<fs::File>> {
    /// Create a new writer which writes to a file.
    pub fn from_path<P: AsRef<Path>>(path: P) -> io::Result<Self> {
        let file = fs::OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(&path)?;
        let writer = BufWriter::with_capacity(1 * MB, file);

        Ok(WarcWriter::new(writer))
    }
}

#[cfg(feature = "gzip")]
impl WarcWriter<BufWriter<GzipWriter<std::fs::File>>> {
    /// Create a new writer which writes to a GZIP-compressed file.
    pub fn from_path_gzip<P: AsRef<Path>>(path: P) -> io::Result<Self> {
        let file = fs::OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(&path)?;
        let gzip_stream = GzipWriter::new(file)?;
        let writer = BufWriter::with_capacity(1 * MB, gzip_stream);

        Ok(WarcWriter::new(writer))
    }
}

record.rs


use chrono::prelude::*;
use std::borrow::Cow;
use std::collections::HashMap;
use std::fmt;
use uuid::Uuid;

use crate::header::WarcHeader;
use crate::record_type::RecordType;
use crate::truncated_type::TruncatedType;
use crate::Error as WarcError;

/// A single WARC record as parsed from a data stream.
///
/// It is guaranteed to be well-formed, but may not be valid according to the specification.
#[derive(Clone, Debug, PartialEq)]
pub struct RawRecord {
    /// The WARC standard version this record reports conformance to.
    pub version: String,
    /// All headers that are part of this record.
    pub headers: HashMap<WarcHeader, Vec<u8>>,
    /// The data body of this record.
    pub body: Vec<u8>,
}

/// A builder for WARC records from data.
#[derive(Clone, Default)]
pub struct RecordBuilder {
    value: Record,
    broken_headers: HashMap<WarcHeader, Vec<u8>>,
    last_error: Option<WarcError>,
}

/// A single WARC record.
///
/// It is guaranteed to be valid according to the specification it conforms to, except:
/// * The validity of the WARC-Record-ID header is not checked
/// * Date information not in the UTC timezone will be silently converted to UTC
///
/// This record can be constructed by a `RecordBuilder` or by a fallable cast from a `RawRecord`.
#[derive(Clone, Debug, PartialEq)]
pub struct Record {
    // NB: invariant: does not contain the headers stored in the struct
    raw: RawRecord,
    record_date: DateTime<Utc>,
    record_id: String,
    record_type: RecordType,
    truncated_type: Option<TruncatedType>,
}

impl Record {
    /// Create a new empty record with default values.
    ///
    /// Using a `RecordBuilder` is more efficient when creating records from known data.
    ///
    /// A default record contains an empty body, and the following fields:
    /// * WARC-Record-ID: generated by `generate_record_id()`
    /// * WARC-Date: the current moment in time
    /// * WARC-Type: resource
    /// * WARC-Content-Length: 0
    pub fn new() -> Record {
        Record::default()
    }

    /// Transform this record into a raw record containing the same data.
    ///
    /// This is similar to `to_raw`, but is more efficient because it avoids copying.
    pub fn into_raw(self) -> RawRecord {
        let Record {
            mut raw,
            record_date,
            record_id,
            record_type,
            ..
        } = self;
        let insert1 = raw.headers.insert(
            WarcHeader::ContentLength,
            format!("{}", raw.body.len()).into(),
        );
        let insert2 = raw
            .headers
            .insert(WarcHeader::WarcType, record_type.to_string().into());
        let insert3 = raw.headers.insert(WarcHeader::RecordID, record_id.into());
        let insert4 = if let Some(ref truncated_type) = self.truncated_type {
            raw.headers
                .insert(WarcHeader::Truncated, truncated_type.to_string().into())
        } else {
            None
        };
        let insert5 = raw.headers.insert(
            WarcHeader::Date,
            record_date
                .to_rfc3339_opts(SecondsFormat::Secs, true)
                .into(),
        );

        debug_assert!(
            insert1.is_none()
                && insert2.is_none()
                && insert3.is_none()
                && insert4.is_none()
                && insert5.is_none(),
            "invariant violation: raw struct contains externally stored fields"
        );

        raw
    }

    /// Create a raw record which contains the same data as this record.
    pub fn to_raw(&self) -> RawRecord {
        let mut raw = self.raw.clone();
        let insert1 = raw.headers.insert(
            WarcHeader::ContentLength,
            format!("{}", self.raw.body.len()).into(),
        );
        let insert2 = raw
            .headers
            .insert(WarcHeader::WarcType, self.record_type.to_string().into());
        let insert3 = raw
            .headers
            .insert(WarcHeader::RecordID, self.record_id.clone().into());
        let insert4 = if let Some(ref truncated_type) = self.truncated_type {
            raw.headers
                .insert(WarcHeader::Truncated, truncated_type.to_string().into())
        } else {
            None
        };
        let insert5 = raw.headers.insert(
            WarcHeader::Date,
            self.record_date
                .to_rfc3339_opts(SecondsFormat::Secs, true)
                .into(),
        );

        debug_assert!(
            insert1.is_none()
                && insert2.is_none()
                && insert3.is_none()
                && insert4.is_none()
                && insert5.is_none(),
            "invariant violation: raw struct contains externally stored fields"
        );

        raw
    }

    /// Generate and return a new value suitable for use in the WARC-Record-ID header.
    ///
    /// # Compatibility
    /// The standard only places a small number of constraints on this field:
    /// 1. This value is globally unique "for its period of use"
    /// 1. This value is a valid URI
    /// 1. This value "clearly indicate\[s\] a documented and registered scheme to which it conforms."
    ///
    /// These guarantees will be upheld by all generated outputs, where the "period of use" is
    /// presumed to be indefinite and unlimited.
    ///
    /// However, any *specific algorithm* used to generate values is **not** part of the crate's
    /// public API for purposes of semantic versioning.
    ///
    /// # Implementation
    /// The current implementation generates random values based on UUID version 4.
    ///
    pub fn generate_record_id() -> String {
        format!("<{}>", Uuid::new_v4().to_urn().to_string())
    }

    fn parse_content_length(len: &str) -> Result<u64, WarcError> {
        (len).parse::<u64>().map_err(|_| {
            WarcError::MalformedHeader(
                WarcHeader::ContentLength,
                "not an integer between 0 and 2^64-1".to_string(),
            )
        })
    }

    fn parse_record_date(date: &str) -> Result<DateTime<Utc>, WarcError> {
        DateTime::parse_from_rfc3339(date)
            .map_err(|_| {
                WarcError::MalformedHeader(
                    WarcHeader::Date,
                    "not an ISO 8601 datestamp".to_string(),
                )
            })
            .map(|date| date.into())
    }

    /// Return the Content-Length header for this record.
    ///
    /// This value is guaranteed to match the actual length of the body.
    pub fn content_length(&self) -> u64 {
        self.raw.body.len() as u64
    }

    /// Return the WARC version string of this record.
    pub fn warc_version(&self) -> &str {
        &self.raw.version
    }

    /// Set the WARC version string of this record.
    pub fn set_warc_version<S: Into<String>>(&mut self, id: S) {
        self.raw.version = id.into();
    }

    /// Return the WARC-Record-ID header for this record.
    pub fn warc_id(&self) -> &str {
        &self.record_id
    }

    /// Set the WARC-Record-ID header for this record.
    ///
    /// Note that this value is **not** checked for validity.
    pub fn set_warc_id<S: Into<String>>(&mut self, id: S) {
        self.record_id = id.into();
    }

    /// Return the WARC-Type header for this record.
    pub fn warc_type(&self) -> &RecordType {
        &self.record_type
    }

    /// Set the WARC-Type header for this record.
    pub fn set_warc_type(&mut self, type_: RecordType) {
        self.record_type = type_;
    }

    /// Return the WARC-Date header for this record.
    pub fn date(&self) -> &DateTime<Utc> {
        &self.record_date
    }

    /// Set the WARC-Date header for this record.
    pub fn set_date(&mut self, date: DateTime<Utc>) {
        self.record_date = date;
    }

    /// Return the WARC-Truncated header for this record.
    pub fn truncated_type(&self) -> &Option<TruncatedType> {
        &self.truncated_type
    }

    /// Set the WARC-Truncated header for this record.
    pub fn set_truncated_type(&mut self, truncated_type: TruncatedType) {
        self.truncated_type = Some(truncated_type);
    }

    /// Remove the WARC-Truncated header for this record.
    pub fn clear_truncated_type(&mut self) {
        self.truncated_type = None;
    }

    /// Return the WARC header requested if present in this record, or `None`.
    pub fn header(&self, header: WarcHeader) -> Option<Cow<'_, str>> {
        match &header {
            WarcHeader::ContentLength => Some(Cow::Owned(format!("{}", self.content_length()))),
            WarcHeader::RecordID => Some(Cow::Borrowed(self.warc_id())),
            WarcHeader::WarcType => Some(Cow::Owned(self.record_type.to_string())),
            WarcHeader::Date => Some(Cow::Owned(
                self.date().to_rfc3339_opts(SecondsFormat::Secs, true),
            )),
            _ => self
                .raw
                .headers
                .get(&header)
                .map(|h| Cow::Owned(String::from_utf8(h.clone()).unwrap())),
        }
    }

    /// Set a WARC header in this record, returning the previous value if present.
    ///
    /// # Errors
    ///
    /// If setting a header whose value has a well-formedness test, an error is returned if the
    /// value is not well-formed.
    pub fn set_header<V>(
        &mut self,
        header: WarcHeader,
        value: V,
    ) -> Result<Option<Cow<'_, str>>, WarcError>
    where
        V: Into<String>,
    {
        let value = value.into();
        match &header {
            WarcHeader::Date => {
                let old_date =
                    std::mem::replace(&mut self.record_date, Record::parse_record_date(&value)?);
                Ok(Some(Cow::Owned(
                    old_date.to_rfc3339_opts(SecondsFormat::Secs, true),
                )))
            }
            WarcHeader::RecordID => {
                let old_id = std::mem::replace(&mut self.record_id, value);
                Ok(Some(Cow::Owned(old_id)))
            }
            WarcHeader::WarcType => {
                let old_type = std::mem::replace(&mut self.record_type, RecordType::from(&value));
                Ok(Some(Cow::Owned(old_type.to_string())))
            }
            WarcHeader::Truncated => {
                let old_type = self.truncated_type.take();
                self.truncated_type = Some(TruncatedType::from(&value));
                Ok(old_type.map(|old| (Cow::Owned(old.to_string()))))
            }
            WarcHeader::ContentLength => {
                if Record::parse_content_length(&value)? != self.content_length() {
                    Err(WarcError::MalformedHeader(
                        WarcHeader::ContentLength,
                        "content length != body size".to_string(),
                    ))
                } else {
                    Ok(Some(Cow::Owned(value)))
                }
            }
            _ => Ok(self
                .raw
                .headers
                .insert(header, Vec::from(value))
                .map(|v| Cow::Owned(String::from_utf8(v).unwrap()))),
        }
    }

    /// Return the body of this record.
    pub fn body(&self) -> &[u8] {
        self.raw.body.as_slice()
    }

    /// Return a reference to mutate the body of this record, but without changing its length.
    ///
    /// To update the body of the record or change its length, use the `replace_body` method
    /// instead.
    pub fn body_mut(&mut self) -> &mut [u8] {
        self.raw.body.as_mut_slice()
    }

    /// Replace the body of this record with the given body.
    pub fn replace_body<V: Into<Vec<u8>>>(&mut self, new_body: V) {
        let _: Vec<u8> = std::mem::replace(&mut self.raw.body, new_body.into());
    }
}

impl Default for Record {
    fn default() -> Record {
        Record {
            raw: RawRecord {
                version: "WARC/1.0".to_string(),
                headers: HashMap::new(),
                body: vec![],
            },
            record_date: Utc::now(),
            record_id: Record::generate_record_id(),
            record_type: RecordType::Resource,
            truncated_type: None,
        }
    }
}

impl std::convert::TryFrom<RawRecord> for Record {
    type Error = WarcError;
    fn try_from(mut raw: RawRecord) -> Result<Self, WarcError> {
        raw.headers
            .remove(&WarcHeader::ContentLength)
            .ok_or_else(|| WarcError::MissingHeader(WarcHeader::ContentLength))
            .and_then(|vec| {
                String::from_utf8(vec).map_err(|_| {
                    WarcError::MalformedHeader(WarcHeader::Date, "not a UTF-8 string".to_string())
                })
            })
            .and_then(|len| Record::parse_content_length(&len))
            .and_then(|len| {
                if len == raw.body.len() as u64 {
                    Ok(())
                } else {
                    Err(WarcError::MalformedHeader(
                        WarcHeader::ContentLength,
                        "content length != body length".to_string(),
                    ))
                }
            })?;

        let record_type = raw
            .headers
            .remove(&WarcHeader::WarcType)
            .ok_or_else(|| WarcError::MissingHeader(WarcHeader::WarcType))
            .and_then(|vec| {
                String::from_utf8(vec).map_err(|_| {
                    WarcError::MalformedHeader(
                        WarcHeader::WarcType,
                        "not a UTF-8 string".to_string(),
                    )
                })
            })
            .map(|rtype| rtype.into())?;

        let record_id = raw
            .headers
            .remove(&WarcHeader::RecordID)
            .ok_or_else(|| WarcError::MissingHeader(WarcHeader::RecordID))
            .and_then(|vec| {
                String::from_utf8(vec).map_err(|_| {
                    WarcError::MalformedHeader(WarcHeader::Date, "not a UTF-8 string".to_string())
                })
            })?;

        let record_date = raw
            .headers
            .remove(&WarcHeader::Date)
            .ok_or_else(|| WarcError::MissingHeader(WarcHeader::Date))
            .and_then(|vec| {
                String::from_utf8(vec).map_err(|_| {
                    WarcError::MalformedHeader(WarcHeader::Date, "not a UTF-8 string".to_string())
                })
            })
            .and_then(|date| Record::parse_record_date(&date))?;

        Ok(Record {
            raw,
            record_date,
            record_id,
            record_type,
            ..Default::default()
        })
    }
}

impl fmt::Display for Record {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        self.clone().into_raw().fmt(f)
    }
}

impl std::convert::From<Record> for RawRecord {
    fn from(record: Record) -> RawRecord {
        record.into_raw()
    }
}

impl fmt::Display for RawRecord {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        writeln!(f, "WARC/{}", self.version)?;

        for (token, value) in self.headers.iter() {
            writeln!(
                f,
                "{}: {}",
                token.to_string(),
                String::from_utf8_lossy(value)
            )?;
        }
        writeln!(f)?;

        if !self.body.is_empty() {
            writeln!(f, "\n{}", String::from_utf8_lossy(&self.body))?;
        }

        writeln!(f)?;

        Ok(())
    }
}

impl RecordBuilder {
    pub fn body(&mut self, body: Vec<u8>) -> &mut Self {
        self.value.replace_body(body);

        self
    }

    pub fn date(&mut self, date: DateTime<Utc>) -> &mut Self {
        self.value.set_date(date);

        self
    }

    pub fn warc_id<S: Into<String>>(&mut self, id: S) -> &mut Self {
        self.value.set_warc_id(id);

        self
    }

    pub fn version(&mut self, version: String) -> &mut Self {
        self.value.set_warc_version(version);

        self
    }

    pub fn warc_type(&mut self, warc_type: RecordType) -> &mut Self {
        self.value.set_warc_type(warc_type);

        self
    }

    pub fn truncated_type(&mut self, trunc_type: TruncatedType) -> &mut Self {
        self.value.set_truncated_type(trunc_type);

        self
    }

    pub fn header<V: Into<Vec<u8>>>(&mut self, key: WarcHeader, value: V) -> &mut Self {
        self.broken_headers.insert(key.clone(), value.into());

        let is_ok;
        match std::str::from_utf8(self.broken_headers.get(&key).unwrap()) {
            Ok(string) => {
                if let Err(e) = self.value.set_header(key.clone(), string) {
                    self.last_error = Some(e);
                    is_ok = false;
                } else {
                    is_ok = true;
                }
            }
            Err(_) => {
                is_ok = false;
                self.last_error = Some(WarcError::MalformedHeader(
                    key.clone(),
                    "not a UTF-8 string".to_string(),
                ));
            }
        }

        if is_ok {
            self.broken_headers.remove(&key);
        }

        self
    }

    pub fn build_raw(self) -> RawRecord {
        let RecordBuilder {
            value,
            broken_headers,
            ..
        } = self;
        let mut raw = value.into_raw();
        raw.headers.extend(broken_headers);

        raw
    }

    pub fn build(self) -> Result<Record, WarcError> {
        let RecordBuilder {
            value,
            broken_headers,
            last_error,
        } = self;

        if let Some(e) = last_error {
            Err(e)
        } else {
            debug_assert!(
                broken_headers.is_empty(),
                "invariant violation: broken headers without last error"
            );
            Ok(value)
        }
    }
}

#[cfg(test)]
mod record_tests {
    use crate::header::WarcHeader;
    use crate::{Record, RecordType};

    use chrono::prelude::*;

    #[test]
    fn default() {
        let before = Utc::now();
        std::thread::sleep(std::time::Duration::from_millis(10));
        let record = Record::default();
        std::thread::sleep(std::time::Duration::from_millis(10));
        let after = Utc::now();
        assert_eq!(record.content_length(), 0);
        assert_eq!(record.warc_version(), "WARC/1.0");
        assert_eq!(record.warc_type(), &RecordType::Resource);
        assert!(record.date() > &before);
        assert!(record.date() < &after);
    }

    #[test]
    fn impl_eq() {
        let record1 = Record::default();
        let record2 = record1.clone();
        assert_eq!(record1, record2);
    }

    #[test]
    fn body() {
        let mut record = Record::default();
        assert_eq!(record.content_length(), 0);
        assert_eq!(record.body(), &[]);
        record.replace_body(b"hello!!".to_vec());
        assert_eq!(record.content_length(), 7);
        assert_eq!(record.body(), b"hello!!");
        record.body_mut().copy_from_slice(b"goodbye");
        assert_eq!(record.content_length(), 7);
        assert_eq!(record.body(), b"goodbye");
    }

    #[test]
    fn add_header() {
        let mut record = Record::default();
        assert!(record.header(WarcHeader::TargetURI).is_none());
        assert!(record
            .set_header(WarcHeader::TargetURI, "https://www.rust-lang.org")
            .unwrap()
            .is_none());
        assert_eq!(
            record.header(WarcHeader::TargetURI).unwrap(),
            "https://www.rust-lang.org"
        );
        assert_eq!(
            record
                .set_header(WarcHeader::TargetURI, "https://docs.rs")
                .unwrap()
                .unwrap(),
            "https://www.rust-lang.org"
        );
        assert_eq!(
            record.header(WarcHeader::TargetURI).unwrap(),
            "https://docs.rs"
        );
    }

    #[test]
    fn set_header_override_content_length() {
        let mut record = Record::default();
        assert_eq!(record.header(WarcHeader::ContentLength).unwrap(), "0");
        assert!(record
            .set_header(WarcHeader::ContentLength, "really short")
            .is_err());
        assert!(record.set_header(WarcHeader::ContentLength, "50").is_err());
        assert_eq!(
            record
                .set_header(WarcHeader::ContentLength, "0")
                .unwrap()
                .unwrap(),
            "0"
        );
    }

    #[test]
    fn set_header_override_warc_date() {
        let mut record = Record::default();
        let old_date = record.date().to_rfc3339_opts(SecondsFormat::Secs, true);
        assert_eq!(record.header(WarcHeader::Date).unwrap(), old_date);
        assert!(record.set_header(WarcHeader::Date, "yesterday").is_err());
        assert_eq!(
            record
                .set_header(WarcHeader::Date, "2020-07-21T22:00:00Z")
                .unwrap()
                .unwrap(),
            old_date
        );
        assert_eq!(
            record.header(WarcHeader::Date).unwrap(),
            "2020-07-21T22:00:00Z"
        );
    }

    #[test]
    fn set_header_override_warc_record_id() {
        let mut record = Record::default();
        let old_id = record.warc_id().to_string();
        assert_eq!(
            record.header(WarcHeader::RecordID).unwrap(),
            old_id.as_str()
        );
        assert_eq!(
            record
                .set_header(WarcHeader::RecordID, "urn:http:www.rust-lang.org")
                .unwrap()
                .unwrap(),
            old_id.as_str()
        );
        assert_eq!(
            record.header(WarcHeader::RecordID).unwrap(),
            "urn:http:www.rust-lang.org"
        );
    }

    #[test]
    fn set_header_override_warc_type() {
        let mut record = Record::default();
        assert_eq!(record.header(WarcHeader::WarcType).unwrap(), "resource");
        assert_eq!(
            record
                .set_header(WarcHeader::WarcType, "revisit")
                .unwrap()
                .unwrap(),
            "resource"
        );
        assert_eq!(record.header(WarcHeader::WarcType).unwrap(), "revisit");
    }
}

#[cfg(test)]
mod raw_tests {
    use crate::header::WarcHeader;
    use crate::{RawRecord, Record, RecordType};

    use std::collections::HashMap;
    use std::convert::TryFrom;

    #[test]
    fn create() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: HashMap::new(),
            body: vec![],
        };

        assert_eq!(record.body.len(), 0);
    }

    #[test]
    fn create_with_headers() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![(
                WarcHeader::WarcType,
                RecordType::WarcInfo.to_string().into_bytes(),
            )]
            .into_iter()
            .collect(),
            body: vec![],
        };

        assert_eq!(record.headers.len(), 1);
    }

    #[test]
    fn verify_ok() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![
                (WarcHeader::WarcType, b"dunno".to_vec()),
                (WarcHeader::ContentLength, b"5".to_vec()),
                (
                    WarcHeader::RecordID,
                    b"<urn:test:basic-record:record-0>".to_vec(),
                ),
                (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
            ]
            .into_iter()
            .collect(),
            body: b"12345".to_vec(),
        };

        assert!(Record::try_from(record).is_ok());
    }

    #[test]
    fn verify_missing_type() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![
                (WarcHeader::ContentLength, b"5".to_vec()),
                (
                    WarcHeader::RecordID,
                    b"<urn:test:basic-record:record-0>".to_vec(),
                ),
                (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
            ]
            .into_iter()
            .collect(),
            body: b"12345".to_vec(),
        };

        assert!(Record::try_from(record).is_err());
    }

    #[test]
    fn verify_missing_content_length() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![
                (WarcHeader::WarcType, b"dunno".to_vec()),
                (
                    WarcHeader::RecordID,
                    b"<urn:test:basic-record:record-0>".to_vec(),
                ),
                (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
            ]
            .into_iter()
            .collect(),
            body: b"12345".to_vec(),
        };

        assert!(Record::try_from(record).is_err());
    }

    #[test]
    fn verify_missing_record_id() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![
                (WarcHeader::WarcType, b"dunno".to_vec()),
                (WarcHeader::ContentLength, b"5".to_vec()),
                (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
            ]
            .into_iter()
            .collect(),
            body: b"12345".to_vec(),
        };

        assert!(Record::try_from(record).is_err());
    }

    #[test]
    fn verify_missing_date() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![
                (WarcHeader::WarcType, b"dunno".to_vec()),
                (WarcHeader::ContentLength, b"5".to_vec()),
                (
                    WarcHeader::RecordID,
                    b"<urn:test:basic-record:record-0>".to_vec(),
                ),
            ]
            .into_iter()
            .collect(),
            body: b"12345".to_vec(),
        };

        assert!(Record::try_from(record).is_err());
    }
}

#[cfg(test)]
mod builder_tests {
    use crate::header::WarcHeader;
    use crate::{RawRecord, Record, RecordBuilder, RecordType, TruncatedType};

    use std::convert::TryFrom;

    #[test]
    fn default() {
        let raw = RecordBuilder::default().build_raw();
        assert_eq!(raw.version, "WARC/1.0".to_string());
        assert_eq!(
            raw.headers.get(&WarcHeader::ContentLength).unwrap(),
            &b"0".to_vec()
        );
        assert!(raw.body.is_empty());
        assert!(RecordBuilder::default().build().is_ok());
    }

    #[test]
    fn impl_eq_raw() {
        let builder = RecordBuilder::default();
        let raw1 = builder.clone().build_raw();

        let raw2 = builder.build_raw();
        assert_eq!(raw1, raw2);
    }

    #[test]
    fn impl_eq_record() {
        let builder = RecordBuilder::default();
        let record1 = builder.clone().build().unwrap();

        let record2 = builder.build().unwrap();
        assert_eq!(record1, record2);
    }

    #[test]
    fn create_with_headers() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![(
                WarcHeader::WarcType,
                RecordType::WarcInfo.to_string().into_bytes(),
            )]
            .into_iter()
            .collect(),
            body: vec![],
        };

        assert_eq!(record.headers.len(), 1);
    }

    #[test]
    fn verify_ok() {
        let record = RawRecord {
            version: "WARC/1.0".to_owned(),
            headers: vec![
                (WarcHeader::WarcType, b"dunno".to_vec()),
                (WarcHeader::ContentLength, b"5".to_vec()),
                (
                    WarcHeader::RecordID,
                    b"<urn:test:basic-record:record-0>".to_vec(),
                ),
                (WarcHeader::Date, b"2020-07-08T02:52:55Z".to_vec()),
            ]
            .into_iter()
            .collect(),
            body: b"12345".to_vec(),
        };

        assert!(Record::try_from(record).is_ok());
    }

    #[test]
    fn verify_content_length() {
        let mut builder = RecordBuilder::default();
        builder.body(b"12345".to_vec());

        assert_eq!(
            builder
                .clone()
                .build()
                .unwrap()
                .into_raw()
                .headers
                .get(&WarcHeader::ContentLength)
                .unwrap(),
            &b"5".to_vec()
        );

        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::ContentLength)
                .unwrap(),
            &b"5".to_vec()
        );

        builder.header(WarcHeader::ContentLength, "1");
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::ContentLength)
                .unwrap(),
            &b"1".to_vec()
        );

        assert!(builder.build().is_err());
    }

    #[test]
    fn verify_build_record_type() {
        let mut builder1 = RecordBuilder::default();
        let mut builder2 = builder1.clone();

        builder1.header(WarcHeader::WarcType, "request");
        builder2.warc_type(RecordType::Request);

        let record1 = builder1.build().unwrap();
        let record2 = builder2.build().unwrap();

        assert_eq!(record1, record2);
        assert_eq!(
            record1.into_raw().headers.get(&WarcHeader::WarcType),
            Some(&b"request".to_vec())
        );
    }

    #[test]
    fn verify_build_date() {
        const DATE_STRING_0: &str = "2020-07-08T02:52:55Z";
        const DATE_STRING_1: &[u8] = b"2020-07-18T02:12:45Z";

        let mut builder = RecordBuilder::default();
        builder.date(Record::parse_record_date(DATE_STRING_0).unwrap());

        let record = builder.clone().build().unwrap();
        assert_eq!(
            record.to_raw().headers.get(&WarcHeader::Date).unwrap(),
            &DATE_STRING_0.as_bytes()
        );
        assert_eq!(
            record.into_raw().headers.get(&WarcHeader::Date).unwrap(),
            &DATE_STRING_0.as_bytes()
        );
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::Date)
                .unwrap(),
            &DATE_STRING_0.as_bytes()
        );

        builder.header(WarcHeader::Date, DATE_STRING_1.to_vec());
        let record = builder.clone().build().unwrap();
        assert_eq!(
            record.to_raw().headers.get(&WarcHeader::Date).unwrap(),
            &DATE_STRING_1.to_vec()
        );
        assert_eq!(
            record.into_raw().headers.get(&WarcHeader::Date).unwrap(),
            &DATE_STRING_1.to_vec()
        );
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::Date)
                .unwrap(),
            &DATE_STRING_1.to_vec()
        );

        builder.header(WarcHeader::Date, b"not-a-dayTor:a:time".to_vec());
        assert!(builder.build().is_err());
    }

    #[test]
    fn verify_build_record_id() {
        const RECORD_ID_0: &[u8] = b"<urn:test:verify-build-id:record-0>";
        const RECORD_ID_1: &[u8] = b"<urn:test:verify-build-id:record-1>";

        let mut builder = RecordBuilder::default();
        builder.warc_id(std::str::from_utf8(RECORD_ID_0).unwrap());

        let record = builder.clone().build().unwrap();
        assert_eq!(
            record.to_raw().headers.get(&WarcHeader::RecordID).unwrap(),
            &RECORD_ID_0.to_vec()
        );
        assert_eq!(
            record
                .into_raw()
                .headers
                .get(&WarcHeader::RecordID)
                .unwrap(),
            &RECORD_ID_0.to_vec()
        );
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::RecordID)
                .unwrap(),
            &RECORD_ID_0.to_vec()
        );

        builder.header(WarcHeader::RecordID, RECORD_ID_1.to_vec());
        let record = builder.clone().build().unwrap();
        assert_eq!(
            record.to_raw().headers.get(&WarcHeader::RecordID).unwrap(),
            &RECORD_ID_1.to_vec()
        );
        assert_eq!(
            record
                .into_raw()
                .headers
                .get(&WarcHeader::RecordID)
                .unwrap(),
            &RECORD_ID_1.to_vec()
        );
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::RecordID)
                .unwrap(),
            &RECORD_ID_1.to_vec()
        );
    }

    #[test]
    fn verify_build_truncated_type() {
        const TRUNCATED_TYPE_0: &[u8] = b"length";
        const TRUNCATED_TYPE_1: &[u8] = b"disconnect";

        let mut builder = RecordBuilder::default();
        builder.truncated_type(TruncatedType::Length);

        let record = builder.clone().build().unwrap();
        assert_eq!(
            record.to_raw().headers.get(&WarcHeader::Truncated).unwrap(),
            &TRUNCATED_TYPE_0.to_vec()
        );
        assert_eq!(
            record
                .into_raw()
                .headers
                .get(&WarcHeader::Truncated)
                .unwrap(),
            &TRUNCATED_TYPE_0.to_vec()
        );
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::Truncated)
                .unwrap(),
            &TRUNCATED_TYPE_0.to_vec()
        );

        builder.header(WarcHeader::Truncated, "disconnect");
        let record = builder.clone().build().unwrap();
        assert_eq!(
            record.to_raw().headers.get(&WarcHeader::Truncated).unwrap(),
            &TRUNCATED_TYPE_1.to_vec()
        );
        assert_eq!(
            record
                .into_raw()
                .headers
                .get(&WarcHeader::Truncated)
                .unwrap(),
            &TRUNCATED_TYPE_1.to_vec()
        );
        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::Truncated)
                .unwrap(),
            &TRUNCATED_TYPE_1.to_vec()
        );

        builder.header(WarcHeader::Truncated, "foreign-intervention");
        assert_eq!(
            builder
                .clone()
                .build()
                .unwrap()
                .into_raw()
                .headers
                .get(&WarcHeader::Truncated)
                .unwrap()
                .as_slice(),
            &b"foreign-intervention"[..]
        );

        assert_eq!(
            builder
                .clone()
                .build_raw()
                .headers
                .get(&WarcHeader::Truncated)
                .unwrap()
                .as_slice(),
            &b"foreign-intervention"[..]
        );
    }
}

record_type.rs


#[derive(Clone, Debug, PartialEq)]
pub enum RecordType {
    WarcInfo,
    Response,
    Resource,
    Request,
    Metadata,
    Revisit,
    Conversion,
    Continuation,
    Unknown(String),
}

impl ToString for RecordType {
    fn to_string(&self) -> String {
        let stringified = match *self {
            RecordType::WarcInfo => "warcinfo",
            RecordType::Response => "response",
            RecordType::Resource => "resource",
            RecordType::Request => "request",
            RecordType::Metadata => "metadata",
            RecordType::Revisit => "revisit",
            RecordType::Conversion => "conversion",
            RecordType::Continuation => "continuation",
            RecordType::Unknown(ref val) => val.as_ref(),
        };
        stringified.to_string()
    }
}

impl &ltS: AsRef&lt&ltstr&gt&gt From&ltS&gt for RecordType {
    fn from(string: S) -> Self {
        let lower: String = string.as_ref().to_lowercase();
        match lower.as_str() {
            "warcinfo" => RecordType::WarcInfo,
            "response" => RecordType::Response,
            "resource" => RecordType::Resource,
            "request" => RecordType::Request,
            "metadata" => RecordType::Metadata,
            "revisit" => RecordType::Revisit,
            "conversion" => RecordType::Conversion,
            "continuation" => RecordType::Continuation,
            _ => RecordType::Unknown(lower),
        }
    }
}

header.rs


/// Represents a WARC header defined by the standard.
///
/// All headers are camel-case versions of the standard names, with the hyphens removed.
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
pub enum WarcHeader {
    ContentLength,
    ContentType,
    BlockDigest,
    ConcurrentTo,
    Date,
    Filename,
    IdentifiedPayloadType,
    IPAddress,
    PayloadDigest,
    Profile,
    RecordID,
    RefersTo,
    SegmentNumber,
    SegmentOriginID,
    SegmentTotalLength,
    TargetURI,
    Truncated,
    WarcType,
    WarcInfoID,
    Unknown(String),
}

impl ToString for WarcHeader {
    fn to_string(&self) -> String {
        let stringified = match self {
            WarcHeader::ContentLength => "content-length",
            WarcHeader::ContentType => "content-type",
            WarcHeader::BlockDigest => "warc-block-digest",
            WarcHeader::ConcurrentTo => "warc-concurrent-to",
            WarcHeader::Date => "warc-date",
            WarcHeader::Filename => "warc-filename",
            WarcHeader::IdentifiedPayloadType => "warc-identified-payload-type",
            WarcHeader::IPAddress => "warc-ip-address",
            WarcHeader::PayloadDigest => "warc-payload-digest",
            WarcHeader::Profile => "warc-profile",
            WarcHeader::RecordID => "warc-record-id",
            WarcHeader::RefersTo => "warc-refers-to",
            WarcHeader::SegmentNumber => "warc-segment-number",
            WarcHeader::SegmentOriginID => "warc-segment-origin-id",
            WarcHeader::SegmentTotalLength => "warc-segment-total-length",
            WarcHeader::TargetURI => "warc-target-uri",
            WarcHeader::Truncated => "warc-truncated",
            WarcHeader::WarcType => "warc-type",
            WarcHeader::WarcInfoID => "warc-warcinfo-id",
            WarcHeader::Unknown(ref string) => string,
        };
        stringified.to_string()
    }
}

impl<S: AsRef<<str>> From<S> for WarcHeader {
    fn from(string: S) -> Self {
        let lower: String = string.as_ref().to_lowercase();
        match lower.as_str() {
            "content-length" => WarcHeader::ContentLength,
            "content-type" => WarcHeader::ContentType,
            "warc-block-digest" => WarcHeader::BlockDigest,
            "warc-concurrent-to" => WarcHeader::ConcurrentTo,
            "warc-date" => WarcHeader::Date,
            "warc-filename" => WarcHeader::Filename,
            "warc-identified-payload-type" => WarcHeader::IdentifiedPayloadType,
            "warc-ip-address" => WarcHeader::IPAddress,
            "warc-payload-digest" => WarcHeader::PayloadDigest,
            "warc-profile" => WarcHeader::Profile,
            "warc-record-id" => WarcHeader::RecordID,
            "warc-refers-to" => WarcHeader::RefersTo,
            "warc-segment-number" => WarcHeader::SegmentNumber,
            "warc-segment-origin-id" => WarcHeader::SegmentOriginID,
            "warc-segment-total-length" => WarcHeader::SegmentTotalLength,
            "warc-target-uri" => WarcHeader::TargetURI,
            "warc-truncated" => WarcHeader::Truncated,
            "warc-type" => WarcHeader::WarcType,
            "warc-warcinfo-id" => WarcHeader::WarcInfoID,
            _ => WarcHeader::Unknown(lower),
        }
    }
}

parser.rs


use nom::{
    bytes::streaming::{tag, take, take_while1},
    character::streaming::{line_ending, not_line_ending, space0},
    error::ErrorKind,
    multi::many1,
    sequence::tuple,
    IResult,
};
use std::str;

// TODO: evaluate the use of `ErrorKind::Verify` here.
fn version(input: &[u8]) -> IResult<&[u8], &str> {let (input, (_, version, _)) = tuple((tag("WARC/"), not_line_ending, line_ending))(input)?;

    let version_str = match str::from_utf8(version) {
        Err(_) => {
            return Err(nom::Err::Error((input, ErrorKind::Verify)));
        }
        Ok(version) => version,
    };

    Ok((input, version_str))
}

fn is_header_token_char(chr: u8) -> bool {
    match chr {
        0..=31
        | 128..=255
        | b'('
        | b')'
        | b'<'
        | b'>'
        | b'@'
        | b','
        | b';'
        | b':'
        | b'"'
        | b'/'
        | b'['
        | b']'
        | b'?'
        | b'='
        | b'{'
        | b'}'
        | b' '
        | b'\\' => false,
        _ => true,
    }
}

fn header(input: &[u8]) -> IResult<&[u8], (&[u8], &[u8])> {
    let (input, (token, _, _, _, value, _)) = tuple((
        take_while1(is_header_token_char),
        space0,
        tag(":"),
        space0,
        not_line_ending,
        line_ending,
    ))(input)?;

    Ok((input, (token, value)))
}

// TODO: evaluate the use of `ErrorKind::Verify` here.
pub fn headers(input: &[u8]) -> IResult<&[u8], (&str, Vec<(&str, &[u8])>, usize)> {
    let (input, version) = version(input)?;
    let (input, headers) = many1(header)(input)?;

    let mut content_length: Option<usize> = None;
    let mut warc_headers: Vec<(&str, &[u8])> = Vec::with_capacity(headers.len());

    for header in headers {
        let token_str = match str::from_utf8(header.0) {
            Err(_) => {
                return Err(nom::Err::Error((input, ErrorKind::Verify)));
            }
            Ok(token) => token,
        };

        if content_length == None && token_str.to_lowercase() == "content-length" {
            let value_str = match str::from_utf8(header.1) {
                Err(_) => {
                    return Err(nom::Err::Error((input, ErrorKind::Verify)));
                }
                Ok(value) => value,
            };

            match value_str.parse::<usize>() {
                Err(_) => {
                    return Err(nom::Err::Error((input, ErrorKind::Verify)));
                }
                Ok(len) => {
                    content_length = Some(len);
                }
            }
        }

        warc_headers.push((token_str, header.1));
    }

    // TODO: Technically if we didn't find a `content-length` header, the record is invalid. Should
    // we be returning an error here instead?
    if content_length == None {
        content_length = Some(0);
    }

    Ok((input, (version, warc_headers, content_length.unwrap())))
}

pub fn record(input: &[u8]) -> IResult<&[u8], (&str, Vec<(&str, &[u8])>, &[u8])> {
    let (input, (headers, _)) = tuple((headers, line_ending))(input)?;
    let (input, (body, _, _)) = tuple((take(headers.2), line_ending, line_ending))(input)?;

    Ok((input, (headers.0, headers.1, body)))
}

#[cfg(test)]
mod tests {
    use super::{header, headers, record, version};
    use nom::error::ErrorKind;
    use nom::Err;
    use nom::Needed;

    #[test]
    fn version_parsing() {
        assert_eq!(version(&b"WARC/0.0\r\n"[..]), Ok((&b""[..], &"0.0"[..])));

        assert_eq!(version(&b"WARC/1.0\r\n"[..]), Ok((&b""[..], &"1.0"[..])));

        assert_eq!(
            version(&b"WARC/2.0-alpha\r\n"[..]),
            Ok((&b""[..], &"2.0-alpha"[..]))
        );
    }

    #[test]
    fn header_pair_parsing() {
        assert_eq!(
            header(&b"some-header: all/the/things\r\n"[..]),
            Ok((&b""[..], (&b"some-header"[..], &b"all/the/things"[..],)))
        );

        assert_eq!(
            header(&b"another-header : with extra spaces\r\n"[..]),
            Ok((
                &b""[..],
                (&b"another-header"[..], &b"with extra spaces"[..],)
            ))
        );

        assert_eq!(
            header(&b"incomplete-header : missing-line-ending"[..]),
            Err(Err::Incomplete(Needed::Unknown))
        );
    }

    #[test]
    fn headers_parsing() {
        let raw_invalid = b"\
            WARC/1.0\r\n\
            content-length: R2D2\r\n\
            that: is not\r\n\
            a-valid: content-length\r\n\
            \r\n\
        ";

        assert_eq!(
            headers(&raw_invalid[..]),
            Err(Err::Error((&b"\r\n"[..], ErrorKind::Verify)))
        );

        let raw = b"\
            WARC/1.0\r\n\
            content-length: 42\r\n\
            foo: is fantastic\r\n\
            bar: is beautiful\r\n\
            baz: is bananas\r\n\
            \r\n\
        ";
        let expected_version = "1.0";
        let expected_headers: Vec<(&str, &[u8])> = vec![
            ("content-length", b"42"),
            ("foo", b"is fantastic"),
            ("bar", b"is beautiful"),
            ("baz", b"is bananas"),
        ];
        let expected_len = 42;

        assert_eq!(
            headers(&raw[..]),
            Ok((
                &b"\r\n"[..],
                (expected_version, expected_headers, expected_len)
            ))
        );
    }

    #[test]
    fn parse_record() {
        let raw = b"\
            WARC/1.0\r\n\
            Warc-Type: dunno\r\n\
            Content-Length: 5\r\n\
            \r\n\
            12345\r\n\
            \r\n\
            WARC/1.0\r\n\
            Warc-Type: another\r\n\
            Content-Length: 6\r\n\
            \r\n\
            123456\r\n\
            \r\n\
        ";

        let expected_version = "1.0";
        let expected_headers: Vec<(&str, &[u8])> =
            vec![("Warc-Type", b"dunno"), ("Content-Length", b"5")];
        let expected_body: &[u8] = b"12345";

        assert_eq!(
            record(&raw[..]),
            Ok((
                &b"WARC/1.0\r\nWarc-Type: another\r\nContent-Length: 6\r\n\r\n123456\r\n\r\n"[..],
                (expected_version, expected_headers, expected_body)
            ))
        );
    }
}

truncated_type.rs


#[derive(Clone, Debug, PartialEq)]
pub enum TruncatedType {
    Length,
    Time,
    Disconnect,
    Unspecified,
    Unknown(String),
}

impl ToString for TruncatedType {
    fn to_string(&self) -> String {
        let stringified = match *self {
            TruncatedType::Length => "length",
            TruncatedType::Time => "time",
            TruncatedType::Disconnect => "disconnect",
            TruncatedType::Unspecified => "unspecified",
            TruncatedType::Unknown(ref val) => val.as_ref(),
        };
        stringified.to_string()
    }
}

impl&ltS: AsRef&lt&ltstr&gt&gt From&ltS&gt for TruncatedType {
    fn from(string: S) -> Self {
        let lower: String = string.as_ref().to_lowercase();
        match lower.as_str() {
            "length" => TruncatedType::Length,
            "time" => TruncatedType::Time,
            "disconnect" => TruncatedType::Disconnect,
            "unspecified" => TruncatedType::Unspecified,
            _ => TruncatedType::Unknown(lower),
        }
    }
}

error.rs


use std::error;
use std::fmt;

use crate::header::WarcHeader;

/// An error type returned by WARC header parsing.
#[derive(Clone, Debug, PartialEq)]
pub enum Error {
    /// An error occured identifing or parsing headers.
    ParseHeaders,
    /// A header required by the standard is missing from the record. The record was well-formed,
    /// but invalid.
    MissingHeader(WarcHeader),
    /// A required header is not well-formed according to the standard.
    MalformedHeader(WarcHeader, String),
    /// The underlying read from the data source failed.
    ReadData,
    /// More data was read than expected by the header metadata. The record was well-formed, but
    /// invalid.
    ReadOverflow,
    /// The end of the record's body was found unexpectedly.
    UnexpectedEOB,
}

impl fmt::Display for Error {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match *self {
            Error::ParseHeaders => write!(f, "Error parsing headers."),
            Error::MissingHeader(ref h) => write!(f, "Missing required header: {}", h.to_string()),
            Error::MalformedHeader(ref h, ref r) => {
                write!(f, "Malformed header: {}: {}", h.to_string(), r)
            }
            Error::ReadData => write!(f, "Error reading data source."),
            Error::ReadOverflow => write!(f, "Read further than expected."),
            Error::UnexpectedEOB => write!(f, "Unexpected end of body."),
        }
    }
}

impl error::Error for Error {}

Improvements/Tweaks which can be performed:

1. Reading the CDX files, which are index files for warc files, can tell the exact offset of a webpage inside a warc file. Utilizing this, we don't need to decompress the whold warc file. We can just decompress the portion of the file at that memory offset. It saves a lot of time and memory to find webpages.

2. The code can also read .warc.gzip files and tell which warc files are in it.

References:

[1] Documentation of rust crate - warc: https://docs.rs/crate/warc/

[2] https://github.com/jedireza

[3] https://docs.rs/libflate/0.1.9/libflate/gzip/index.html

[4] https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/#

[5] https://archive.org/download/CC-MAIN-2021-04-1610703512342.19-0022