Skip to content

Content parsers

Parsers transform raw response bodies into readable text. ParserManager picks the right parser from the Content-Type header; each concrete parser handles one media type.

ParserManager

Maps Content-Type values to the appropriate :class:~go2web.http.parsers.abstract_parser.Parser.

The registry performs a case-insensitive substring match, so "text/html; charset=utf-8" correctly resolves to :class:~go2web.http.parsers.html_parser.HTMLParser.

Built-in registry:

Content-Type Parser
text/html :class:~go2web.http.parsers.html_parser.HTMLParser
application/json :class:~go2web.http.parsers.json_parser.JSONParser
text/plain :class:~go2web.http.parsers.plain_text_parser.PlainTextParser
Example

from go2web.http.parsers.parser_manager import ParserManager manager = ParserManager() parser = manager.get_parser("application/json; charset=utf-8") parser.parse('{"key": "value"}') '{\n "key": "value"\n}'

Source code in src/go2web/http/parsers/parser_manager.py
class ParserManager:
    """Maps ``Content-Type`` values to the appropriate :class:`~go2web.http.parsers.abstract_parser.Parser`.

    The registry performs a case-insensitive substring match, so
    ``"text/html; charset=utf-8"`` correctly resolves to :class:`~go2web.http.parsers.html_parser.HTMLParser`.

    Built-in registry:

    | Content-Type | Parser |
    |---|---|
    | ``text/html`` | :class:`~go2web.http.parsers.html_parser.HTMLParser` |
    | ``application/json`` | :class:`~go2web.http.parsers.json_parser.JSONParser` |
    | ``text/plain`` | :class:`~go2web.http.parsers.plain_text_parser.PlainTextParser` |

    Example:
        >>> from go2web.http.parsers.parser_manager import ParserManager
        >>> manager = ParserManager()
        >>> parser = manager.get_parser("application/json; charset=utf-8")
        >>> parser.parse('{"key": "value"}')
        '{\\n  "key": "value"\\n}'
    """

    _REGISTRY: dict[str, Parser] = {
        "text/html": HTMLParser(),
        "application/json": JSONParser(),
        "text/plain": PlainTextParser(),
    }

    def get_parser(self, content_type: str) -> Parser:
        """Return the parser for *content_type*.

        Args:
            content_type: The value of the ``Content-Type`` response header,
                e.g. ``"text/html; charset=utf-8"``.

        Returns:
            A :class:`~go2web.http.parsers.abstract_parser.Parser` instance.

        Raises:
            ~go2web.http.parsers.exceptions.ParseError: When no parser is
                registered for *content_type*.
        """
        for ct, parser in self._REGISTRY.items():
            if ct.lower() in content_type.lower():
                return parser
        raise ParseError(f"Error: Cant find parser for: {content_type} (for now)")

get_parser(content_type)

Return the parser for content_type.

Parameters:

Name Type Description Default
content_type str

The value of the Content-Type response header, e.g. "text/html; charset=utf-8".

required

Returns:

Name Type Description
A Parser

class:~go2web.http.parsers.abstract_parser.Parser instance.

Raises:

Type Description
~ParseError

When no parser is registered for content_type.

Source code in src/go2web/http/parsers/parser_manager.py
def get_parser(self, content_type: str) -> Parser:
    """Return the parser for *content_type*.

    Args:
        content_type: The value of the ``Content-Type`` response header,
            e.g. ``"text/html; charset=utf-8"``.

    Returns:
        A :class:`~go2web.http.parsers.abstract_parser.Parser` instance.

    Raises:
        ~go2web.http.parsers.exceptions.ParseError: When no parser is
            registered for *content_type*.
    """
    for ct, parser in self._REGISTRY.items():
        if ct.lower() in content_type.lower():
            return parser
    raise ParseError(f"Error: Cant find parser for: {content_type} (for now)")

HTMLParser

Bases: Parser

Strip HTML markup and return readable plain text.

Non-content tags (<script>, <style>, <noscript>, <head>) are removed entirely before text extraction so that rendered output contains only visible page content.

Example

parser = HTMLParser() parser.parse("

Hello

World

") 'Hello\nWorld'

Source code in src/go2web/http/parsers/html_parser.py
class HTMLParser(AbstractParser):
    """Strip HTML markup and return readable plain text.

    Non-content tags (``<script>``, ``<style>``, ``<noscript>``, ``<head>``)
    are removed entirely before text extraction so that rendered output
    contains only visible page content.

    Example:
        >>> parser = HTMLParser()
        >>> parser.parse("<html><body><h1>Hello</h1><p>World</p></body></html>")
        'Hello\\nWorld'
    """

    def parse(self, body: str) -> str:
        """Extract visible text from an HTML document.

        Args:
            body: Raw HTML string.

        Returns:
            Whitespace-normalised plain text with newline separators.
        """
        soup = BeautifulSoup(body, "html.parser")

        # drop non-content tags entirely
        for tag in soup(["script", "style", "noscript", "head"]):
            tag.decompose()

        text = soup.get_text(separator="\n", strip=True)

        return text.strip()

parse(body)

Extract visible text from an HTML document.

Parameters:

Name Type Description Default
body str

Raw HTML string.

required

Returns:

Type Description
str

Whitespace-normalised plain text with newline separators.

Source code in src/go2web/http/parsers/html_parser.py
def parse(self, body: str) -> str:
    """Extract visible text from an HTML document.

    Args:
        body: Raw HTML string.

    Returns:
        Whitespace-normalised plain text with newline separators.
    """
    soup = BeautifulSoup(body, "html.parser")

    # drop non-content tags entirely
    for tag in soup(["script", "style", "noscript", "head"]):
        tag.decompose()

    text = soup.get_text(separator="\n", strip=True)

    return text.strip()

JSONParser

Bases: Parser

Parse a JSON response body and return it pretty-printed.

Example

parser = JSONParser() parser.parse('{"name":"go2web","version":"0.1.0"}') '{\n "name": "go2web",\n "version": "0.1.0"\n}'

Source code in src/go2web/http/parsers/json_parser.py
class JSONParser(AbstractParser):
    """Parse a JSON response body and return it pretty-printed.

    Example:
        >>> parser = JSONParser()
        >>> parser.parse('{"name":"go2web","version":"0.1.0"}')
        '{\\n  "name": "go2web",\\n  "version": "0.1.0"\\n}'
    """

    def parse(self, body: str) -> str:
        """Deserialise *body* and return it indented with two spaces.

        Args:
            body: A JSON-encoded string.

        Returns:
            Pretty-printed JSON with ``ensure_ascii=False`` so Unicode
            characters are preserved.

        Raises:
            ~go2web.http.parsers.exceptions.ParseError: When *body* is not
                valid JSON.
        """
        try:
            data = json.loads(body)
            return json.dumps(data, indent=2, ensure_ascii=False)
        except json.JSONDecodeError as e:
            raise ParseError("Error: Bad Jayson or content encoding") from e

parse(body)

Deserialise body and return it indented with two spaces.

Parameters:

Name Type Description Default
body str

A JSON-encoded string.

required

Returns:

Type Description
str

Pretty-printed JSON with ensure_ascii=False so Unicode

str

characters are preserved.

Raises:

Type Description
~ParseError

When body is not valid JSON.

Source code in src/go2web/http/parsers/json_parser.py
def parse(self, body: str) -> str:
    """Deserialise *body* and return it indented with two spaces.

    Args:
        body: A JSON-encoded string.

    Returns:
        Pretty-printed JSON with ``ensure_ascii=False`` so Unicode
        characters are preserved.

    Raises:
        ~go2web.http.parsers.exceptions.ParseError: When *body* is not
            valid JSON.
    """
    try:
        data = json.loads(body)
        return json.dumps(data, indent=2, ensure_ascii=False)
    except json.JSONDecodeError as e:
        raise ParseError("Error: Bad Jayson or content encoding") from e

PlainTextParser

Bases: Parser

Pass plain-text bodies through with leading/trailing whitespace stripped.

Example

parser = PlainTextParser() parser.parse(" hello world ") 'hello world'

Source code in src/go2web/http/parsers/plain_text_parser.py
class PlainTextParser(AbstractParser):
    """Pass plain-text bodies through with leading/trailing whitespace stripped.

    Example:
        >>> parser = PlainTextParser()
        >>> parser.parse("  hello world  ")
        'hello world'
    """

    def parse(self, body: str) -> str:
        """Return *body* stripped of surrounding whitespace.

        Args:
            body: A plain-text response body.

        Returns:
            The body string with leading and trailing whitespace removed.
        """
        return body.strip()

parse(body)

Return body stripped of surrounding whitespace.

Parameters:

Name Type Description Default
body str

A plain-text response body.

required

Returns:

Type Description
str

The body string with leading and trailing whitespace removed.

Source code in src/go2web/http/parsers/plain_text_parser.py
def parse(self, body: str) -> str:
    """Return *body* stripped of surrounding whitespace.

    Args:
        body: A plain-text response body.

    Returns:
        The body string with leading and trailing whitespace removed.
    """
    return body.strip()

Parser (abstract)

Bases: ABC

Abstract base class for response body parsers.

Subclasses implement :meth:parse to transform a raw response body string into a human-readable representation suitable for terminal output.

To add a new content type, subclass :class:Parser and register the instance in :class:~go2web.http.parsers.parser_manager.ParserManager.

Source code in src/go2web/http/parsers/abstract_parser.py
class Parser(ABC):
    """Abstract base class for response body parsers.

    Subclasses implement :meth:`parse` to transform a raw response body string
    into a human-readable representation suitable for terminal output.

    To add a new content type, subclass :class:`Parser` and register the
    instance in :class:`~go2web.http.parsers.parser_manager.ParserManager`.
    """

    @abstractmethod
    def parse(self, body: str) -> str:
        """Parse *body* and return a human-readable string.

        Args:
            body: The raw response body decoded from bytes.

        Returns:
            A cleaned, human-readable representation of *body*.

        Raises:
            ~go2web.http.parsers.exceptions.ParseError: When the body cannot
                be interpreted as the expected format.
        """
        ...

parse(body) abstractmethod

Parse body and return a human-readable string.

Parameters:

Name Type Description Default
body str

The raw response body decoded from bytes.

required

Returns:

Type Description
str

A cleaned, human-readable representation of body.

Raises:

Type Description
~ParseError

When the body cannot be interpreted as the expected format.

Source code in src/go2web/http/parsers/abstract_parser.py
@abstractmethod
def parse(self, body: str) -> str:
    """Parse *body* and return a human-readable string.

    Args:
        body: The raw response body decoded from bytes.

    Returns:
        A cleaned, human-readable representation of *body*.

    Raises:
        ~go2web.http.parsers.exceptions.ParseError: When the body cannot
            be interpreted as the expected format.
    """
    ...

ParseError

Bases: Exception

Raised when a parser cannot interpret a response body.

Typically caught by :class:~go2web.commands.fetch.Fetcher and displayed as a styled error panel in the terminal.

Source code in src/go2web/http/parsers/exceptions.py
4
5
6
7
8
9
class ParseError(Exception):
    """Raised when a parser cannot interpret a response body.

    Typically caught by :class:`~go2web.commands.fetch.Fetcher` and displayed
    as a styled error panel in the terminal.
    """