Text Import Settings: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Jeremy
imported>Scott
 
(3 intermediate revisions by one other user not shown)
Line 18: Line 18:
In the most flexible parsing method, there can be row and column labels anywhere in the file (top/bottom/left/right).
In the most flexible parsing method, there can be row and column labels anywhere in the file (top/bottom/left/right).


Note that, if a file contains an axis scale (numerical values which represent reference values and not data), these will be imported as data and will have to be converted into an axis scale later. See the [[DataSet_Editor#Data_tab|DataSet Editor Data Tab]] page. Alternatively, the [[XY Delimited Text Format|#XY Delimited Text Format]] can be used for importing data with an axis scale as the first column of a delimited text file.
Note that, if a file contains an axis scale (numerical values which represent reference values and not data), these will be imported as data and will have to be converted into an axis scale later. See the [[DataSet_Editor#Data_tab|DataSet Editor Data Tab]] page. Alternatively, the [[#XY Delimited Text Format|XY Delimited Text Format]] can be used for importing data with an axis scale as the first column of a delimited text file.


===Importing Options===
===Importing Options===
Line 33: Line 33:
** '''Automatic (strict,fast)''' : faster automatic parsing which does not handle header lines, and expects that all row labels will be on the left-hand side of the data and all column labels will be on the top of the columns. If this returns the wrong result or fails, try the 'automatic' parsing method.
** '''Automatic (strict,fast)''' : faster automatic parsing which does not handle header lines, and expects that all row labels will be on the left-hand side of the data and all column labels will be on the top of the columns. If this returns the wrong result or fails, try the 'automatic' parsing method.
** '''Manual''' : user specifies how many row and column labels are present as well as header lines (see options below).
** '''Manual''' : user specifies how many row and column labels are present as well as header lines (see options below).
** '''Automatic (stream)''' : nearly identical to 'Automatic' but reads from the file in pieces. This allows reading somewhat larger files than might otherwise be readable because of memory limitations.
** '''Graphical Selection''' : opens a window where each row and column can be manually set as data, label, class, or axisscale.
* '''Delimiter''' Specifies the text character which separates the columns of data. If '''automatic''' is selected, the file is scanned for the most consistently used character. Otherwise, select the character used to separate the columns of data.
* '''Delimiter''' Specifies the text character which separates the columns of data. If '''automatic''' is selected, the file is scanned for the most consistently used character. Otherwise, select the character used to separate the columns of data.
* '''Comment Character''' Specifies the character which, when appearing as the first character in a line, indicates the line is a comment and not data.
* '''Comment Character''' Specifies the character which, when appearing as the first character in a line, indicates the line is a comment and not data.
Line 62: Line 64:
* The first column of the data is assumed to be numerical values which should be used as an axisscale.
* The first column of the data is assumed to be numerical values which should be used as an axisscale.
* No labels are permitted for the rows.
* No labels are permitted for the rows.
This is available to PLS_Toolbox users through the [[xyreadr]] function.

Latest revision as of 16:45, 3 December 2013

The Text Import Settings window allows you to choose how to import text formatted files. These files are often comma-separated values (CSV) files or similar. The settings window is accessed from the various "Import" menus and selecting the "Delimited Text File" format.

General Format

Files of this format are arranged as rows of numbers separated into columns by some text delimiter (comma, space, tab). There are often one or more columns or rows of text labels which describe each column or row of values. There may also be some number of rows at the top of the file which give a description or other technical details about the data (header rows).

Example:

These are the header lines describing the file
There are two of them. The next line is the column labels
        , c1, c2, c3, c4, c5
sample A,  9,  7,  5,  3,  1
sample B,  8,  6,  4,  2,  0

In the most flexible parsing method, there can be row and column labels anywhere in the file (top/bottom/left/right).

Note that, if a file contains an axis scale (numerical values which represent reference values and not data), these will be imported as data and will have to be converted into an axis scale later. See the DataSet Editor Data Tab page. Alternatively, the XY Delimited Text Format can be used for importing data with an axis scale as the first column of a delimited text file.

Importing Options

After specifying which file to import, the Text Import Settings window appears:

Text import settings.png

By default, the importer will be set with options that work on many files, but changing the options may provide faster importing, reduce the amount of post-import changes you have to make, or fix potential importing problems. The following options are available:

Primary Parsing Options

  • Parsing There are three options for parsing:
    • Automatic : the most flexible importing method. The file is automatically parsed for labels and header information. This works on many standard arrangements with different numbers of rows and column labels. May take some time to complete with larger files. See note below regarding additional options available with 'automatic' parsing.
    • Automatic (strict,fast) : faster automatic parsing which does not handle header lines, and expects that all row labels will be on the left-hand side of the data and all column labels will be on the top of the columns. If this returns the wrong result or fails, try the 'automatic' parsing method.
    • Manual : user specifies how many row and column labels are present as well as header lines (see options below).
    • Automatic (stream) : nearly identical to 'Automatic' but reads from the file in pieces. This allows reading somewhat larger files than might otherwise be readable because of memory limitations.
    • Graphical Selection : opens a window where each row and column can be manually set as data, label, class, or axisscale.
  • Delimiter Specifies the text character which separates the columns of data. If automatic is selected, the file is scanned for the most consistently used character. Otherwise, select the character used to separate the columns of data.
  • Comment Character Specifies the character which, when appearing as the first character in a line, indicates the line is a comment and not data.
    Example: #This is a comment - not data
    see also "Header Rows" option.
  • Header Rows Indicates the number of rows at the top of the file which are not data and should be read as comments only.

Manual Parsing Options

These options are only available when parsing is "Manual".

  • Row Labels Indicates how may rows of column-labels appear at the top of the file. With manual parsing, all column labels must be at the top of the file (first rows).
  • Column Labels Indicates how may columns of row-labels appear at the left of the file. With manual parsing, all row labels must be at the left-hand side of the file (first columns).

Other Options

  • EU Format When checked, the parser expects decimal values to be indicated with a comma rather than a period. Note that when checked, the delimiter setting (above) can NOT be a comma.
  • Treat consecutive delimiters as one When checked, delimiters which appear in succession without any other text between them are considered one: 1,,2,,3 = 1,2,3 Otherwise, an empty element like this is represented by the placeholder value NaN ("not a number") in the parsed data: 1,,2,,3 = 1,NaN,2,NaN,3
  • Hard-delete empty columns and rows When checked, any rows or columns of data which are completely empty (i.e. all values are NaN as explained above) are removed from the imported DataSet. If a column or row has any non-NaN value, it will not be removed.
  • Transpose When checked, the imported DataSet will have columns which represent the rows of the imported file (and rows of the DataSet will be columns of the file). Useful when a file has samples as columns of the file because other Solo and PLS_Toolbox operations expect samples to be rows.

Command-Line Options

PLS_Toolbox users can access many of these options and more through the xclreadr function. Also note that this function will also accept options like those passed to the parsemixed function.

XY Delimited Text Format

The XY... Delimited Text format is very similar to the delimited text importer, with the following differences:

  • Columns of the file are assumed to be samples (thus, the data is always transposed after import)
  • The first column of the data is assumed to be numerical values which should be used as an axisscale.
  • No labels are permitted for the rows.

This is available to PLS_Toolbox users through the xyreadr function.