Hello, I have a text file that contains text stripped from a PDF document. This text contains non-ascii characters that I have to remove before I can run it through some text-mining software.
I have looked at the ord function to remove the ascii values that are not in the basic ascii table, but I am not sure how to use this over the whole text file. I thought of parsing each line, then looking at each letter/non-letter in turn. I have also looked at the previous searches on text cleaning but these are just for stripping out letters and desired content - not non-ascii. Does anybody have any recomendations for removing these chars? Many thanks, MonkPaul.
I'm not really a human, but I play one on earth. By on Nov 19, 2012 at 08:34 UTC This is not ASCII, this is real ascii: Otherwise it will trim out newlines and other special characters that are part of ascii table! By (Canon) on Nov 21, 2012 at 14:45 UTC Correct. 'includes definitions for 128 characters: 33 are non-printing control characters. And 95 printable characters.'
See this 'American Standard Code for Information Interchange (ASCII)' from 1963, the 5th page in particular. This definition is also enshrined in Internet. By on Nov 21, 2012 at 08:48 UTC This is not ASCI Sure it is, 32 through 126 (precisely all the characters that aren't 32 through 126 ) by (Hermit) on Jun 07, 2007 at 12:36 UTC Try this, $str = s/^!- s//g; In the above,!- is a range which matches all characters between! The range is set between! And because these are the first and last characters in the ASCII table (Alt+033 for!
And Alt+126 for in Windows). As this range does not include whitespace, s is separately included. T simply represents a tab character. S is similar to t but the metacharacter s is a shorthand for a whole character class that matches any whitespace character. This includes space, tab, newline and carriage return. Or simply, $str! s/^:ascii://g; by on Oct 27, 2011 at 06:25 UTC Cool.
This worked for me. By (Canon) on Jun 08, 2007 at 10:07 UTC This text contains non-ascii characters that I have to remove before I can run it through some text-mining software. You don't expect to have to handle any accented characters?
Those aren't 'Ascii'. By on Jun 08, 2007 at 02:26 UTC.
Assuming that 'foreign' means 'not an ASCII character', then you can use find with a pattern to find all files not having printable ASCII characters in their names: LCALL=C find.' (The space is the first printable character listed onis the last.) The hint for LCALL=C is required (actually, LCCTYPE=C and LCCOLLATE=C), otherwise the character range is interpreted incorrectly. See also the manual page glob(7). Since LCALL=C causes find to interpret strings as ASCII, it will print multi-byte characters (such as π) as question marks. To fix this, pipe to some program (e.g. Cat) or redirect to file.
Instead of specifying character ranges, :print: can also be used to select 'printable characters'. Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.
Example: $ touch $(printf ' u03c0') '$(printf 'x ty')' $ ls -F dir/ foo foo.c xrestop-0.4/ xrestop-0.4.tar.gz π $ find -name '.! -.'
# this is broken (LCCOLLATE=enUS.UTF-8)./x?y./dir./π. (a lot more)./foo.c $ LCALL=C find. $ LCALL=C find.' cat./x y./π $ LCALL=C find.name '.!:print:.' cat./x y./π.