Class Tokenizer

java.lang.Object
mars.assembler.token.Tokenizer

public class Tokenizer extends Object
A tokenizer is capable of tokenizing a complete MIPS program, or a given line from a MIPS program. Since MIPS is line-oriented, each line defines a complete statement. Tokenizing is the process of analyzing the input MIPS program for the purpose of recognizing each MIPS language element. The types of language elements are known as "tokens". MIPS tokens are defined in the TokenType class.

Example:

here:  lw  $t3, 8($t4)   #load third member of array
The above is tokenized as IDENTIFIER, COLON, OPERATOR, REGISTER_NAME, COMMA, INTEGER_5, LEFT_PAREN, REGISTER_NAME, RIGHT_PAREN, COMMENT.

The original MARS tokenizer was written by Pete Sanderson in August 2003.

Author:
Pete Sanderson, August 2003; Sean Clarke, July 2024
  • Constructor Details

    • Tokenizer

      public Tokenizer()
  • Method Details

    • tokenizeFile

      public static SourceFile tokenizeFile(String filename, AssemblerLog log)
      Tokenize a complete MIPS program from a file, line by line. Each line of source code is translated into a SourceLine, which consists of both the original code and its tokenized form.

      Note: Equivalences, includes, and macros are handled by the Preprocessor at this stage.

      Parameters:
      filename - The name of the file containing source code to be tokenized.
      log - The error list, which will be populated with any tokenizing errors in the given lines.
      Returns:
      The tokenized source file.
    • tokenizeFile

      public static SourceFile tokenizeFile(String filename, AssemblerLog log, Preprocessor preprocessor)
    • tokenizeLines

      public static SourceFile tokenizeLines(String filename, List<String> lines, AssemblerLog log)
      Tokenize a complete MIPS program, line by line. Each line of source code is translated into a SourceLine, which consists of both the original code and its tokenized form.

      Note: Equivalences, includes, and macros are handled by the Preprocessor at this stage.

      Parameters:
      filename - The filename indicating where the given source code is from.
      lines - The source code to be tokenized.
      log - The error list, which will be populated with any tokenizing errors in the given lines.
      Returns:
      The tokenized source file.
    • tokenizeLines

      public static SourceFile tokenizeLines(String filename, List<String> lines, AssemblerLog log, boolean isInExpansionTemplate)
    • tokenizeLines

      public static SourceFile tokenizeLines(String filename, List<String> lines, AssemblerLog log, boolean isInExpansionTemplate, Preprocessor preprocessor)
    • tokenizeLine

      public static SourceLine tokenizeLine(String filename, String line, int lineIndex, AssemblerLog log, Preprocessor preprocessor)
      Tokenize one line of source code. If lexical errors are discovered, they are added to the given error list rather than being thrown as exceptions.
      Parameters:
      filename - The filename indicating where the given source code is from.
      line - The content of the line to be tokenized.
      lineIndex - The line index in the source file (for error reporting).
      log - The error list, which will be populated with any tokenizing errors in the given lines.
      preprocessor - The current preprocessor instance, which will process token substitutions.
      Returns:
      The generated tokens for the given line.
    • tokenizeLine

      public static SourceLine tokenizeLine(String filename, String line, int lineIndex, AssemblerLog log, Preprocessor preprocessor, boolean isInExpansionTemplate)
    • isValidIdentifier

      public static boolean isValidIdentifier(String value)
      COD2, A-51: "Identifiers are a sequence of alphanumeric characters, underbars (_), and dots (.) that do not begin with a number."

      DPS 14-Jul-2008: added '$' as valid symbol. Permits labels to include $. MIPS-target GCC will produce labels that start with $.

    • handleCharEscape

      public static int handleCharEscape(StringBuilder value, SourceLocation lineLocation, int columnIndex, String line, AssemblerLog log)
      Handle an escape for a character or string literal. It is assumed that index has already been incremented past the initial backslash.
      Parameters:
      value - The destination where the resulting character value will be appended.
      lineLocation - The filename of the current source file.
      columnIndex - The index in line of the character immediately following the initial backslash.
      line - The raw form of the current line of source code.
      log - The error list, which will be added to in case of an invalid character escape.
      Returns:
      The new value of index, which corresponds to the next character in the line.
    • hexadecimalDigitValue

      public static int hexadecimalDigitValue(int ch)
      Interpret the given character as a hexadecimal digit, if possible. For digits A through F, both uppercase and lowercase are accepted.
      Parameters:
      ch - The character to interpret as a hexadecimal digit.
      Returns:
      The hexadecimal digit value in the range [0, 16), or -1 if ch is not a valid hexadecimal digit.