firefox - Copying text from YouTube to Clipboard introduces dashes?

07
2014-07
  • sammyg

    Here's an example of a link I found on YouTube in the comments section of a video.

    gnu.org/distros/free-distros.h­tml
    

    This is the way it shows up in the comment.

    If I highlight this link and copy to clipboard (ctrl+c), then go to a new browser tab and paste it (ctrl+v) in the address bar, then this is how it shows up.

    gnu.org/distros/free-distros.h­tml
    

    It looks the same, right? But if I hit Enter I get an error.

    404 - Page Not Found

    The page you were looking for could not be found on the GNU web server.

    If you followed a link that turned out to be broken, and the page with the broken link mentions an explicit address to which to report bugs, please use that address.

    The URL also changes to the following.

    http://www.gnu.org/distros/free-distros.h%C2%ADtml%EF%BB%BF
    

    If I remove %C2%ADtml%EF%BB%BF and type in tml so that I get back the address http://www.gnu.org/distros/free-distros.html and then hit Enter, well now it works, and the page loads.

    I thought to myself that this is very strange so I tried pasting the same text from clipboard to a plain text editor (notepad) and this is what I got.

    gnu.org/distros/free-distros.h­-tml
    

    How was the dash between h and tml introduced? This is why I was getting the 404 error. But the URL appears correctly when pasted to the address bar. Is this some kind of hidden character perhaps?

    Also, if I go back to YouTube and highlight the link, I can see that there is a bump on the last three letters. The highlighting is taller around "tml". You can see that in the screen capture below.

    screen1

    screen2

    Why is this happening? What's going on? Could it be that Google is somehow intentionally salting the link?

    Update

    If I paste into Notepad++ (version 6.3) I get following.

    gnu.org/distros/free-distros.h­tml?
    

    If I try to paste into the address bar of the Google Chrome browser, there appears to be some kind of hidden character at the end of the URL. See scree capture below.

    screen3

    That's not a white space. It's something else... something alien! Something from planet X?

    Note: The vertical line at the end is not the one I mean, that's just the text input cursor blinking.

    Update 2

    Inspecting the html code in Firefox by using the element inspection tool.

    screen4

    Why is there a square within the opening wbr tag?

    Update 3

    The "square" appears to be the soft hyphen character entity. Here follows the actual source code of this particular line.

    <p>gnu.org/distros/free-distros.h<wbr>&shy;tml</p>
    

    The soft hyphen is the &shy; you see here. HTML tags, such as or i.e. for bold text, are not selectable. When you highlight a text of a web page in a browser, you are not selecting the HTML tags. Nothing within <> is shown.

    So it seems that soft hyphen is the root cause of the copy and paste issue. It is not displayed on the web page, but it is selected when you highlight the text.

    Update 4

    This is what it looks like when I paste the URL into Microsoft Word 2010 and view hidden characters.

    screen5

    To move the text cursor from .|html to .ht|ml requires pressing the arrow key three times. You can tell by the image above why that is. It's because of this hidden character. With the cursor in front of that strange looking character, pressing Alt+X shows 0068. With the cursor behind that character, and in front of the letter T reveals nothing at all. The 0068 is just the Unicode code page for the letter H.

  • Answers
  • barlop

    Yes it is a nuisance.

    There are two hiphens The normal one \u2D, and the funny one. The funny one is used sometimes within youtube comments. \u00AD and comes up as hidden.

    Paste into notepad(to remove formatting) and also, notepad shows it, and then into MS Word(or just in Ms Word do paste special..unformatted unicode), put your cursor to the right of the hiphen, or any character, and press ALT-x and you see the ASCII or unicode code for it.

    This may seem strange. Be aware that there are a few characters with two different types. A type you use usually which is within the 0-7F range, and a type people tend to not use much or at all, which is >7F. The two types of spaces(a normal one and another called the non-breaking space, ascii code 160 \uA0 which can be of use). There two types of pipes 7C and A6 The A6 one is just asking for problems as it causes failures on the command line. And two types of hiphens, the second one you see, behaves funny too, as youtube comments sometimes use it and hide it and don't display it as a hiphen.

    Another funny character I see which is used by youtube in comments is \uFEFF You can run notepad2(download it), choose file..encoding..UTF-8 then paste the text in, and search for \uFEFF replacing with nothing, (check the box that says transform backslashes).

    Similarly you can open notepad2, search for \u00AD (that funny hiphen) and replace it with a regular hiphen. Editpad free might be able to do it, though I use the pro version for its regex support.

    I'd note that charmap doesn't copy the funny hiphen correctly. (So if you want to experiment and you choose copy and paste it into a piece of software and it displays funny, blame charmap), but it copies fine(as in with the character) from your link in my browser(chrome). Better if the character wasn't there though, it is a nuisance! But you can see the ascii code of it in Ms Word, and you can search and remove it in notepad2

    You see from charmap it(\u00AD) is called the "soft Hiphen" (i'm just glad they didn't hiphenate that title!)

    In the pic I used Ms Word and did ALT-x

    enter image description here

  • Levans

    Looking at the source code of this portion of page, I see this :

    <p>gnu.org/distros/free-distros.h<wbr>­tml</p>
    

    It seems that Youtube automatically inserted a <wbr> tag. It's a word-break opportunity, it tells the browser that if needed, the word may be broken to insert a newline.

    On UTF-8 encoded pages, this is displayed as a ZERO-WIDTH SPACE, not showing anything, but allowing a newline. That's what causes your encoding issue.

    It looks like Youtube has an algorithm to auto-insert <wbr>into long words at good places (not cutting a syllabe in two) , but as the http:// was missing at the beginning of the URL, the algorithm didn't recognize it as such, and thus assumed it was a word that could be broken.


  • Related Question

    windows - Any utilities for automatically pasting/assembling copied text from the clipboard?
  • CMPalmer

    I thought about writing a utility to do this, but decided to look around for one first. There are dozens of clipboard managers and clipboard-related programs, but I haven't seen one that does exactly what I want.

    I want a program that will "monitor" the system clipboard and anytime it sees new text appear in the clipboard, it will paste the plain, unformatted text into a window or file.

    I have a collection of documents, some of them PowerPoint presentations, that I need to scrape the text from into a text document. I can select the text, copy, then go into Word and "Paste Special/unformatted text" or just to Notepad and paste, but that requires a lot of keystrokes and application switching.

    What I'd like to be able to do is fire up a utility and let it run in the background then be able to highlight text, press Ctrl-C, and have the utility automatically append the copied text to a text window or a file until I tell it to stop.

    I was thinking of just writing a console app to do this and redirecting its output to a file, but a Window app would be OK as well - just so I can scrape unformatted text from multiple sources and assemble the text into a new document quickly and easily.

    Any suggestions or ideas before I go off an spend as much time writing the utility app as it would take to do it by hand?


  • Related Answers
  • CMPalmer

    So far Ditto seems to be working for me, but it's kind of awkward to use. Very non-standard interface, but I think I have a set of options that will work.

    Would still love to see any better suggestions!

  • Phoshi
    ^c::
        send ^c
        clipwait
        clipboard = %clipboard%
        FileAppend, %clipboard%, c:\File.txt
    return
    

    Should I think perform this task. Might have some trouble with linebreaks, but I'm short on time, so sorry :(

    edit: In autohotkey, sorry!

  • 8088

    CLCL is my preferred clipboard manager.
    It also keeps its history on file (without sqlite).

    image

  • Contango

    Macro Express Pro will be able to do this for you.

    It has a very nice graphical language that allows you to drag sequences of commands onto a list.

    I've been using it for about 5 years, it powerful enough to do just about anything you can imagine, including this task.

    Suggest setting the "activation" to "clipboard", add a command to copy the contents of the clipboard to a variable, then switch focus to your target program, paste it in, then switch focus back.

    I've just finished writing a big a big macro to do the same thing as you're describing, to copy text from a web page into a document for safe keeping whenever I hit a hotkey.